GenderPayGap instance example¶

In [1]:
from GenderPayGap import*
In [2]:
df = pd.read_excel('sample_salary.xls')
df.head()
Out[2]:
SCALE TENURE ZONE CENTER GENDER TYPE SALARY
0 145 1 WEST Emp: 36 to 50 FEMALE FULL-TIME 17515
1 145 1 WEST Emp: 36 to 50 FEMALE FULL-TIME 16031
2 145 11 WEST Emp: 10 to 20 FEMALE FULL-TIME 21633
3 145 9 WEST Emp: 10 to 20 FEMALE FULL-TIME 20001
4 145 2 WEST Emp: 10 to 20 FEMALE FULL-TIME 19623
In [3]:
df.SCALE=df.SCALE-df.SCALE.min()
In [4]:
example= GenderPayGap(df,'GENDER', 'SALARY', swap=True)
GENDER MALE FEMALE RawGAP %RawGAP
SALARY 30,802.15 22,748.05 8,054.10 26.15
In [5]:
example.exploratory_data_analysis()
SCALE
Skew : 2.88
count mean std min 25% 50% 75% max
GENDER
FEMALE 152.00 41.78 34.95 0.00 0.00 55.00 70.00 111.00
MALE 538.00 65.63 38.02 0.00 55.00 55.00 55.00 399.00
TENURE
Skew : 1.28
count mean std min 25% 50% 75% max
GENDER
FEMALE 152.00 5.54 5.37 1.00 2.00 4.00 7.00 41.00
MALE 538.00 12.01 9.21 1.00 4.00 10.00 18.00 49.00
SALARY
Skew : 4.0
count mean std min 25% 50% 75% max
GENDER
FEMALE 152.00 22,748.05 6,621.03 12,025.00 17,987.50 21,979.00 26,072.75 51,186.00
MALE 538.00 30,802.15 13,832.55 12,960.00 24,311.00 27,431.00 32,835.25 162,953.00
count unique top freq
ZONE 690 5 NORTH 257
count unique top freq
CENTER 690 5 Emp: >50 268
count unique top freq
GENDER 690 2 MALE 538
count unique top freq
TYPE 690 2 FULL-TIME 672
Show polinomial plots for numerical variables
In [6]:
example.prepare_data(max_unique=45,column_to_exp='SCALE', exponent=2, drop_original=True)
Identified columns to encode...  ['ZONE', 'CENTER', 'GENDER', 'TYPE']
Included for encoding..........  ZONE
Included for encoding..........  CENTER
Included for encoding..........  TYPE
New column added...............  SCALE**2 ( SCALE raised to the power of 2 )
Original column droped...........  SCALE
New dataframe total columns....  13
0 1 2 3 4
TENURE 1 1 11 9 2
GENDER_FEMALE 1 1 1 1 1
SALARY 17515 16031 21633 20001 19623
ZONE_EAST 0 0 0 0 0
ZONE_NORTH 0 0 0 0 0
ZONE_SOUTH 0 0 0 0 0
ZONE_WEST 1 1 1 1 1
CENTER_Emp: 21 to 35 0 0 0 0 0
CENTER_Emp: 36 to 50 1 1 0 0 0
CENTER_Emp: <10 0 0 0 0 0
CENTER_Emp: >50 0 0 0 0 0
TYPE_PART-TIME 0 0 0 0 0
SCALE**2 0 0 0 0 0
In [7]:
example.select_significant()
Initial columns....................................  13
Constant column added for Ordinary Least Squares regression
Adjusted r-square with original variables .........  0.808176844474361
Variables to drop ( "p-value" > 0.05 ).............  5
Variables dropped:.................................  ['ZONE_EAST', 'ZONE_NORTH', 'CENTER_Emp: 36 to 50', 'CENTER_Emp: <10', 'CENTER_Emp: >50']
Adjuster r-square with significant variables.......  0.8036018924535696
Final variables considered.........................  9
0 1 2 3 4
const 1.00 1.00 1.00 1.00 1.00
TENURE 1.00 1.00 11.00 9.00 2.00
GENDER_FEMALE 1.00 1.00 1.00 1.00 1.00
SALARY 17,515.00 16,031.00 21,633.00 20,001.00 19,623.00
ZONE_SOUTH 0.00 0.00 0.00 0.00 0.00
ZONE_WEST 1.00 1.00 1.00 1.00 1.00
CENTER_Emp: 21 to 35 0.00 0.00 0.00 0.00 0.00
TYPE_PART-TIME 0.00 0.00 0.00 0.00 0.00
SCALE**2 0.00 0.00 0.00 0.00 0.00
In [8]:
example.plot_coefficients()
In [9]:
example.avg_decomposition(width=None, height=None)
Salary decomposition with significant p-values (> 0.05 )
No width and height specified
Out[9]:
Value_STD Value_MIN Value_MAX Value_MEAN Coefficients Salary_STD Salary_MEAN
const 0.00 1.00 1.00 1.00 21,146.06 0.00 21,146.06
SCALE**2 10,063.02 0.00 159,201.00 5,134.84 0.92 9,216.49 4,702.88
TENURE 8.93 1.00 49.00 10.58 359.36 3,208.27 3,802.92
ZONE_SOUTH 0.33 0.00 1.00 0.12 8,401.30 2,749.08 1,022.77
TYPE_PART-TIME 0.16 0.00 1.00 0.03 -3,294.54 -525.51 -85.94
GENDER_FEMALE 0.41 0.00 1.00 0.22 -2,233.70 -926.41 -492.06
ZONE_WEST 0.43 0.00 1.00 0.24 -2,152.28 -914.88 -508.44
CENTER_Emp: 21 to 35 0.39 0.00 1.00 0.19 -2,928.69 -1,152.77 -560.27
Total 10,073.67 2.00 159,256.00 5,147.21 19,298.42 11,654.28 29,027.91
In [10]:
example.gap_decomposition(width=None, height=None)
Salary decomposition with significant p-values (> 0.05 )
No width and height specified
Out[10]:
MALE FEMALE Value_GAP Coefficients MALE_PAY FEMALE_PAY Salary_GAP Percentage_GAP
SCALE**2 5,749.64 2,958.75 2,790.89 0.92 5,265.97 2,709.85 2,556.11 48.54
TENURE 12.01 5.54 6.47 359.36 4,314.94 1,990.64 2,324.30 53.87
GENDER_FEMALE 0.00 1.00 NaN -2,233.70 -0.00 -2,233.70 2,233.70 -inf
ZONE_SOUTH 0.14 0.04 0.11 8,401.30 1,218.03 331.63 886.40 72.77
TYPE_PART-TIME 0.01 0.07 -0.06 -3,294.54 -42.87 -238.42 195.55 -456.15
const 1.00 1.00 0.00 21,146.06 21,146.06 21,146.06 0.00 0.00
ZONE_WEST 0.24 0.23 0.01 -2,152.28 -512.07 -495.59 -16.48 3.22
CENTER_Emp: 21 to 35 0.20 0.16 0.04 -2,928.69 -587.92 -462.42 -125.49 21.34
Total 5,763.25 2,966.79 2,797.46 19,298.42 30,802.15 22,748.05 8,054.10 26.15
In [11]:
example.gap_summary()
Adjusted r-square.......  0.8036018924535696
GENDER MALE FEMALE RawGAP %RawGAP AdjustedGAP % AdjustedGAP
SALARY 30,802.15 22,748.05 8,054.10 26.15 2,233.70 7.25
Out[11]:
OaxacaB_Two-Fold
MALE_PAY 30,802.15
FEMALE_PAY 22,748.05
RawGAP 8,054.10
FEMALE_PAY_Predicted 24,981.75
Explained_GAP 5,820.40
Unexplained_GAP 2,233.70
R-square_adj 0.80
In [12]:
example.ols_first.summary2()
Out[12]:
Model: OLS Adj. R-squared: 0.808
Dependent Variable: SALARY AIC: 13907.6682
Date: 2022-09-08 17:20 BIC: 13966.6452
No. Observations: 690 Log-Likelihood: -6940.8
Df Model: 12 F-statistic: 242.9
Df Residuals: 677 Prob (F-statistic): 6.53e-236
R-squared: 0.812 Scale: 3.2590e+07
Coef. Std.Err. t P>|t| [0.025 0.975]
const 23220.1277 1252.0285 18.5460 0.0000 20761.8021 25678.4533
TENURE 362.3815 26.9964 13.4233 0.0000 309.3747 415.3883
GENDER_FEMALE -1976.0477 581.3050 -3.3993 0.0007 -3117.4251 -834.6702
ZONE_EAST -181.8904 1551.2432 -0.1173 0.9067 -3227.7165 2863.9356
ZONE_NORTH -2137.6259 1503.3425 -1.4219 0.1555 -5089.4002 814.1484
ZONE_SOUTH 6823.4378 1570.0954 4.3459 0.0000 3740.5960 9906.2797
ZONE_WEST -3536.9049 1490.4893 -2.3730 0.0179 -6463.4421 -610.3676
CENTER_Emp: 21 to 35 -3752.1336 962.0835 -3.9000 0.0001 -5641.1598 -1863.1073
CENTER_Emp: 36 to 50 -1831.6376 957.1823 -1.9136 0.0561 -3711.0403 47.7652
CENTER_Emp: <10 -277.9501 1065.7298 -0.2608 0.7943 -2370.4831 1814.5828
CENTER_Emp: >50 -666.2423 908.5699 -0.7333 0.4636 -2450.1959 1117.7113
TYPE_PART-TIME -2872.4677 1417.3445 -2.0267 0.0431 -5655.3872 -89.5482
SCALE**2 0.9195 0.0233 39.4652 0.0000 0.8738 0.9653
Omnibus: 123.426 Durbin-Watson: 1.328
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1908.281
Skew: -0.234 Prob(JB): 0.000
Kurtosis: 11.134 Condition No.: 167575